sparse autoencoder
- North America > United States > Massachusetts > Hampshire County > Amherst Center (0.04)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Singapore (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- North America > United States > Texas (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Leisure & Entertainment (0.93)
- Media > Television (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)
Meet the new biologists treating LLMs like aliens
By studying large language models as if they were living things instead of computer programs, scientists are discovering some of their secrets for the first time. How large is a large language model? Think about it this way. In the center of San Francisco there's a hill called Twin Peaks from which you can view nearly the entire city. Picture all of it--every block and intersection, every neighborhood and park, as far as you can see--covered in sheets of paper. Now picture that paper filled with numbers. LLMs contain a LOT of parameters. That's one way to visualize a large language model, or at least a medium-size one: Printed out in 14-point type, a 200-billion-parameter model, such as GPT4o (released by OpenAI in 2024), could fill 46 square miles of paper--roughly enough to cover San Francisco.
- North America > United States > California > San Francisco County > San Francisco (0.44)
- Pacific Ocean > North Pacific Ocean > San Francisco Bay > Golden Gate (0.04)
- North America > United States > Massachusetts (0.04)
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- Health & Medicine (0.47)
- Media (0.47)
Interpreto: An Explainability Library for Transformers
Poché, Antonin, Mullor, Thomas, Sarti, Gabriele, Boisnard, Frédéric, Friedrich, Corentin, Claye, Charlotte, Hoofd, François, Bernas, Raphael, Hudelot, Céline, Jourdan, Fanny
Interpreto is a Python library for post-hoc explainability of text HuggingFace models, from early BERT variants to LLMs. It provides two complementary families of methods: attributions and concept-based explanations. The library connects recent research to practical tooling for data scientists, aiming to make explanations accessible to end users. It includes documentation, examples, and tutorials. Interpreto supports both classification and generation models through a unified API. A key differentiator is its concept-based functionality, which goes beyond feature-level attributions and is uncommon in existing libraries. The library is open source; install via pip install interpreto. Code and documentation are available at https://github.com/FOR-sight-ai/interpreto.
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)
Circuits, Features, and Heuristics in Molecular Transformers
Varadi, Kristof, Marosi, Mark, Antal, Peter
Transformers generate valid and diverse chemical structures, but little is known about the mechanisms that enable these models to capture the rules of molecular representation. We present a mechanistic analysis of autoregressive transformers trained on drug-like small molecules to reveal the computational structure underlying their capabilities across multiple levels of abstraction. We identify computational patterns consistent with low-level syntactic parsing and more abstract chemical validity constraints. Using sparse autoencoders (SAEs), we extract feature dictionaries associated with chemically relevant activation patterns. We validate our findings on downstream tasks and find that mechanistic insights can translate to predictive performance in various practical settings.
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Materials > Chemicals > Commodity Chemicals (0.46)
Unveiling Latent Knowledge in Chemistry Language Models through Sparse Autoencoders
Cohen, Jaron, Hasson, Alexander G., Tanovic, Sara
Since the advent of machine learning, interpretability has remained a persistent challenge, becoming increasingly urgent as generative models support high-stakes applications in drug and material discovery. Recent advances in large language model (LLM) architectures have yielded chemistry language models (CLMs) with impressive capabilities in molecular property prediction and molecular generation. However, how these models internally represent chemical knowledge remains poorly understood. In this work, we extend sparse autoencoder techniques to uncover and examine interpretable features within CLMs. Applying our methodology to the Foundation Models for Materials (FM4M) SMI-TED chemistry foundation model, we extract semantically meaningful latent features and analyse their activation patterns across diverse molecular datasets. Our findings reveal that these models encode a rich landscape of chemical concepts. We identify correlations between specific latent features and distinct domains of chemical knowledge, including structural motifs, physicochemical properties, and pharmacological drug classes. Our approach provides a generalisable framework for uncovering latent knowledge in chemistry-focused AI systems. This work has implications for both foundational understanding and practical deployment; with the potential to accelerate computational chemistry research.
- South America > Brazil (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering
Yeon, Jehyeok, Cinus, Federico, Wu, Yifan, Luceri, Luca
Large language models (LLMs) face critical safety challenges, as they can be manipulated to generate harmful content through adversarial prompts and jailbreak attacks. Many defenses are typically either black-box guardrails that filter outputs, or internals-based methods that steer hidden activations by operationalizing safety as a single latent feature or dimension. While effective for simple concepts, this assumption is limiting, as recent evidence shows that abstract concepts such as refusal and temporality are distributed across multiple features rather than isolated in one. To address this limitation, we introduce Graph-Regularized Sparse Autoencoders (GSAEs), which extends SAEs with a Laplacian smoothness penalty on the neuron co-activation graph. Unlike standard SAEs that assign each concept to a single latent feature, GSAEs recover smooth, distributed safety representations as coherent patterns spanning multiple features. We empirically demonstrate that GSAE enables effective runtime safety steering, assembling features into a weighted set of safety-relevant directions and controlling them with a two-stage gating mechanism that activates interventions only when harmful prompts or continuations are detected during generation. This approach enforces refusals adaptively while preserving utility on benign queries. Across safety and QA benchmarks, GSAE steering achieves an average 82% selective refusal rate, substantially outperforming standard SAE steering (42%), while maintaining strong task accuracy (70% on TriviaQA, 65% on TruthfulQA, 74% on GSM8K). Robustness experiments further show generalization across LLaMA-3, Mistral, Qwen, and Phi families and resilience against jailbreak attacks (GCG, AutoDAN), consistently maintaining >= 90% refusal of harmful content.
- North America > United States > California (0.14)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models
He, Zirui, Jin, Mingyu, Shen, Bo, Payani, Ali, Zhang, Yongfeng, Du, Mengnan
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but controlling their behavior reliably remains challenging, especially in open-ended generation settings. This paper introduces a novel supervised steering approach that operates in sparse, interpretable representation spaces. We employ sparse autoencoders (SAEs) to obtain sparse latent representations that aim to disentangle semantic attributes from model activations. Then we train linear classifiers to identify a small subspace of task-relevant dimensions in latent representations. Finally, we learn supervised steering vectors constrained to this subspace, optimized to align with target behaviors. Experiments across sentiment, truthfulness, and political polarity steering tasks with multiple LLMs demonstrate that our supervised steering vectors achieve higher success rates with minimal degradation in generation quality compared to existing methods. Further analysis reveals that a notably small subspace is sufficient for effective steering, enabling more targeted and interpretable interventions. Our implementation is publicly available at https://github.com/Ineedanamehere/SAE-SSV.
- North America > United States (0.14)
- Europe > France (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Asia > Middle East > UAE (0.04)
Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone
Bărbălau, Antonio, Păduraru, Cristian Daniel, Poncu, Teodor, Tifrea, Alexandru, Burceanu, Elena
Sparse Autoencoders (SAEs) are widely employed for mechanistic interpretabil-ity and model steering. Within this context, steering is by design performed by means of decoding altered SAE intermediate representations. In contrast to existing literature, we forward an encoder-centric alternative to model steering which demonstrates a stronger cross-modal performance. We introduce S&P T op-K, a retraining-free and computationally lightweight Selection and Projection framework that identifies T op-K encoder features aligned with a sensitive attribute or behavior, optionally aggregates them into a single control axis, and computes an orthogonal projection to be subsequently applied directly in the model's native embedding space. In vision-language models, it improves fairness metrics on CelebA and FairFace by up to 3.2 times over conventional SAE usage, and in large language models, it substantially reduces aggressiveness and sycophancy in Llama-3 8B Instruct, achieving up to 3.6 times gains over masked reconstruction. These findings suggest that encoder-centric interventions provide a general, efficient, and more effective mechanism for shaping model behavior at inference time than the traditional decoder-centric use of SAEs.Figure 1: Sample generation demonstrating behavioral steering interventions on Llama 3 8B Instruct prompted to produce a sycophantic opinion. We apply two Sparse Autoencoder (SAE)-based methods to remove sycophancy: the conventional decoder-centric Masked Reconstruction approach and our proposed encoder-centric S&P Top-K protocol. Lower LLM-as-a-judge sycophancy scores indicate superior mitigation of the targeted behavioral pattern. The results illustrate that conventional Masked Reconstruction fails to suppress sycophantic behavior, while our S&P Top-K intervention successfully redirects the model's output, eliminating direct praise, repeatedly deferring endorsement, and leading the model to ultimately employ laudatory language in a sarcastic manner that subverts the original sycophantic intent. The main steps of our approach are highlighted in green. We first employ a selection mechanism to identify relevant SAE features.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Arizona > Pima County > Tucson (0.04)
- North America > Mexico > Gulf of Mexico (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)
Mechanistic Interpretability of Antibody Language Models Using SAEs
Haque, Rebonto, Turnbull, Oliver M., Parsan, Anisha, Parsan, Nithin, Yang, John J., Deane, Charlotte M.
Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate an autoregressive antibody language model, p-IgGen, and steer its generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose an hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain-specific protein language models and suggest that, while TopK SAEs are sufficient for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.28)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)